Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, \eg, a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Data-efficient learning on graphs (GEL) is essential in real-world applications. Existing GEL methods focus on learning useful representations for nodes, edges, or entire graphs with ``small'' labeled data. But the problem of data-efficient learning for subgraph prediction has not been explored. The challenges of this problem lie in the following aspects: 1) It is crucial for subgraphs to learn positional features to acquire structural information in the base graph in which they exist. Although the existing subgraph neural network method is capable of learning disentangled position encodings, the overall computational complexity is very high. 2) Prevailing graph augmentation methods for GEL, including rule-based, sample-based, adaptive, and automated methods, are not suitable for augmenting subgraphs because a subgraph contains fewer nodes but richer information such as position, neighbor, and structure. Subgraph augmentation is more susceptible to undesirable perturbations. 3) Only a small number of nodes in the base graph are contained in subgraphs, which leads to a potential ``bias'' problem that the subgraph representation learning is dominated by these ``hot'' nodes. By contrast, the remaining nodes fail to be fully learned, which reduces the generalization ability of subgraph representation learning. In this paper, we aim to address the challenges above and propose a Position-Aware Data-Efficient Learning framework for subgraph neural networks called PADEL. Specifically, we propose a novel node position encoding method that is anchor-free, and design a new generative subgraph augmentation method based on a diffused variational subgraph autoencoder, and we propose exploratory and exploitable views for subgraph contrastive learning. Extensive experiment results on three real-world datasets show the superiority of our proposed method over state-of-the-art baselines.
translated by 谷歌翻译
最近,视觉变压器及其变体在人类和多视图人类姿势估计中均起着越来越重要的作用。将图像补丁视为令牌,变形金刚可以对整个图像中的全局依赖项进行建模或其他视图中的图像。但是,全球关注在计算上是昂贵的。结果,很难将这些基于变压器的方法扩展到高分辨率特征和许多视图。在本文中,我们提出了代币螺旋的姿势变压器(PPT)进行2D人姿势估计,该姿势估计可以找到粗糙的人掩模,并且只能在选定的令牌内进行自我注意。此外,我们将PPT扩展到多视图人类姿势估计。我们建立在PPT的基础上,提出了一种新的跨视图融合策略,称为人类区域融合,该策略将所有人类前景像素视为相应的候选者。可可和MPII的实验结果表明,我们的PPT可以在减少计算的同时匹配以前的姿势变压器方法的准确性。此外,对人类360万和滑雪姿势的实验表明,我们的多视图PPT可以有效地从多个视图中融合线索并获得新的最新结果。
translated by 谷歌翻译
大规模的无向加权网络通常在与大数据相关的研究领域中发现。自然可以将其量化为用于实施大数据分析任务的对称高维和不完整(SHDI)矩阵。对称非负潜在因素分析(SNL)模型能够从SHDI基质中有效提取潜在因子(LFS)。然而,它依赖于约束培训计划,这使其缺乏灵活性。为了解决这个问题,本文提出了一个不受限制的对称非负潜在因素分析(USNL)模型。它的主要思想是两个方面:1)通过将非负映射函数集成到SNL模型中,输出LFS与决策参数分开; 2)随机梯度下降(SGD)用于实施不受限制的模型训练,并确保输出LFS非负性。对由实际的大数据应用产生的四个SHDI矩阵的实证研究表明,与SNL模型相比,USNL模型可实现缺失数据的预测准确性,以及高度竞争性的计算效率。
translated by 谷歌翻译
您将如何通过一些错过来修复物理物体?您可能会想象它的原始形状从先前捕获的图像中,首先恢复其整体(全局)但粗大的形状,然后完善其本地细节。我们有动力模仿物理维修程序以解决点云完成。为此,我们提出了一个跨模式的形状转移双转化网络(称为CSDN),这是一种带有全循环参与图像的粗到精细范式,以完成优质的点云完成。 CSDN主要由“ Shape Fusion”和“ Dual-Refinect”模块组成,以应对跨模式挑战。第一个模块将固有的形状特性从单个图像传输,以指导点云缺失区域的几何形状生成,在其中,我们建议iPadain嵌入图像的全局特征和部分点云的完成。第二个模块通过调整生成点的位置来完善粗糙输出,其中本地改进单元通过图卷积利用了小说和输入点之间的几何关系,而全局约束单元则利用输入图像来微调生成的偏移。与大多数现有方法不同,CSDN不仅探讨了图像中的互补信息,而且还可以在整个粗到精细的完成过程中有效利用跨模式数据。实验结果表明,CSDN对十个跨模式基准的竞争对手表现出色。
translated by 谷歌翻译
机器人需要多种互动模式来与人类在复杂的工业任务中进行稳健合作。我们开发了共存和共存(可可)人类机器人协作系统。共存模式使机器人能够在共享空间中独立地与人类在不同子任务上合作。合作模式使机器人能够遵循人类的指导并恢复失败。人类意图跟踪算法将人类和机器人运动测量作为输入,并提供了交互模式的开关。我们证明了可可系统在用例中类似于现实世界多步组件任务的有效性。
translated by 谷歌翻译
协作机器人需要有效的人类意图估算,以便在诸如人类意图不断变化的工业集会等结构化任务中安全,平稳地与人类合作。我们提出了意图跟踪的概念,并引入了一个协作机器人系统,该系统同时跟踪层次级别的意图。跟踪高级意图以估计人类的相互作用模式,并使机器人能够(1)避免与人碰撞以最大程度地减少中断或(2)帮助人类纠正失败。低级意图估算为机器人提供了特定任务的信息,以进行并发执行。我们在UR5E机器人上实现了该系统,并通过消融试验性研究在组装用例中展示了强大的,无缝和人体工程学的人类机器人协作。
translated by 谷歌翻译
段4K或6K超高分辨率图像需要在图像分割中考虑额外的计算考虑。常见的策略,如淡化采样,补丁裁剪和级联模型,不能妥善解决精度和计算成本之间的余额问题。由人类在粗糙到精确水平中连续地区分物体的影响,我们提出了用于超高分辨率分割任务的连续细化模型〜(CRM)。CRM连续将特征映射与细化目标保持一致,并聚合要重建这些图像的细节。此外,我们的CRM表明其具有填补低分辨率培训图像和超高分辨率测试之间的分辨率差距的重要概括能力。我们展示了定量的绩效评估和可视化,以表明我们的提出方法在图像分割细化方面是快速有效的。代码将在https://github.com/dvlab-research/entity发布。
translated by 谷歌翻译
我们提出了一种准确和有效的场景文本检测框架,最快(即,更快的任意形状的文本检测器)。与最近的先进文本探测器不同,使用手工制作的网络架构和复杂的后处理,导致低推理速度,快速有两个新设计。 (1)我们通过设计网络搜索空间和奖励功能来搜索网络架构,仔细定制文本检测,导致比大多数搜索图像分类的网络更强大的功能。 (2)我们设计一个简单的表示(仅具有1通道输出),以模拟具有任意形状的文本,以及GPU平行后处理,以有效地组装文本线路的时间开销。受益于这两种设计,快速实现了几个具有挑战性的数据集的准确性和效率之间的出色权衡。例如,FAST-A0在总文本的152 FPS下产生81.4%F测量,在准确性和速度方面优于最快的方法1.5点和70 FPS。凭借RentorT优化,推断速度可以进一步加速到超过600 fps。
translated by 谷歌翻译